Informative Plots

Time Series

Since we do not see a trend, we understand that the time series are stationary (i.e. depend on the time of the observation).

The plot below does not seem to detail a specific seasonal trend and pattern. Therefore, the search pattern of these topics is not related to the time of the year. However, we can see that in certain times of the year there are some large peaks, and the majority seem to be linked to the British Royal family. These peaks could be due to independent events occurring, such as the well-being of the Queen or the arrival of the TV series the Crown that could spark interest in people’s searches in regards to the royal family. We can also see that the Summer Olympic searches are high around the time of it happening or building up to the event. However, once the Olympic games ended, the searches fall drastically, as to be expected.

Bar plot

We have made a bar plot depicting the frequency of views and searches on a given topic/type.

Modified Series

\[ R_i = \frac{X_i - X_{i-7}}{X_{i-7}} * 100 \] \(R_i\) is the relative weekly change (increase or decrease) in percentage.

When plotting the relative weekly change we see the highs and lows more clearly. There is a massive high load in the traffic count April 2016 for Princess Margaret and another big but smaller peak for Winston Churchill in February 2016.

Peaks-Over-Threshold

For each time series (\(R_i,\ i = 1, ..., 12\)), we plot their daily count distribution on a Q-Q plot to assess if their data come from a theoretical distribution (e.g. Normal). Using the qq draws the correlation between a given sample and the normal distribution. The geom_line(…, distribution = stats::qqnorm) function visually checks the normality assumption of the distribution. If the assumption does not hold, we look into how/what data points contribute to the violation.

Results: on all below Q-Q plots, we observe that the daily count distributions follow a normal distribution when in the bulk of the distribution but departs from when in the tail indicating that the data is skewed. Indeed, it is right-skewed (i.e. positively skewed).

Knowing that the data is right-skewed we generate the time series’ Mean Residual Plot using the mrlplot() function from the extRemes package which plots the potential thresholds against the mean excess. The following plots are used to choose the most adequate thresholds. Indeed, the value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold.

After setting their theoretical optimal threshold, we plot their Peaks-Over-Threshold Plot with their threshold \(u\) to display the exceedences. Note that if a threshold is too low then the extreme value approach cannot be valid and that if it is too high we cannot obtain insightful results because of too few data.

Elizabeth II

Distribution

Mean Residual Plot

The selected threshold \(u\) is 125.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few data points exceeding the threshold nor to many. Therefore, we assume that using the POT model is suitable.

United States

Distribution

Mean Residual Plot

The selected threshold \(u\) is 35.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

Queen Victoria

Distribution

Mean Residual Plot

The selected threshold \(u\) is 100.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.

World War II

Distribution

Mean Residual Plot

The selected threshold \(u\) is 30.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.

World War I

Distribution

Mean Residual Plot

The selected threshold \(u\) is 25.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few nor too many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.

George VI

Distribution

Mean Residual Plot

The selected threshold \(u\) is 200.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

United Kingdom

Distribution

Mean Residual Plot

The selected threshold \(u\) is 50.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

Princess Margaret, Countess of Snowdon

Distribution

Mean Residual Plot

The selected threshold \(u\) is 700.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

Prince Philip, Duke of Edinburgh

Distribution

Mean Residual Plot

The selected threshold \(u\) is 400.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

Winston Churchill

Distribution

Mean Residual Plot

The selected threshold \(u\) is 100.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

Diana, Princess of Wales

Distribution

Mean Residual Plot

The selected threshold \(u\) is 70.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are not too few nor to many data points exceeding the threshold. Therefore, we assume that using the POT model is suitable.

2016 Summer Olympics

Distribution

Mean Residual Plot

The selected threshold \(u\) is 60.

Peaks-Over-Threshold Plot

The chosen threshold indicates that there are too few data points exceeding the threshold. Therefore, we believe that using the POT model is maybe not suitable.

99-Quantile

  • Estimate the 99%-quantile of each series and give a corresponding measure of uncertainty = provide an interval of values within which the true value of the threshol is believed to lie with a stated probability (MSE ? with confidence interval ?)
##             2016_Summer_Olympics Diana,_Princess_of_Wales Elizabeth_II
## quantile99              143.5829                 363.1843     642.9873
## uncertainty             179.0603                 492.1521    1753.4239
##              George_VI Prince_Philip,_Duke_of_Edinburgh
## quantile99    520.8474                         1111.279
## uncertainty 10635.0359                         1579.921
##             Princess_Margaret,_Countess_of_Snowdon Queen_Victoria
## quantile99                                973.8326       421.7058
## uncertainty                             17545.2920       544.1161
##             United_Kingdom United_States Winston_Churchill World_War_I
## quantile99        79.08561      62.24281          846.2276    94.64611
## uncertainty     1784.17935     266.52798         7139.2646    77.44259
##             World_War_II
## quantile99      61.75727
## uncertainty     48.12695

Detecting Simultaneous High Load

Graphical Method

The graphical method that we suggest for detecting simultaneous high traffic loads is a Block Maxima interactive plot colored by topic. This allows to visually see the daily maxima between the 12 different web pages. If for one or several days that display high maxima but also other value from other webpages that are just below, it indicates a simultaneous high load.

By pointing the mouse cursor over data points, more information is available. With this plot, we can observe that Princess Margret, Georges IV, and Prince Philip of Edinburgh have the largest simultaneous high loads around 2016 November 6, but also have another smaller simultaneous high loads in 2016 April 21. So the idea for Wikipedia is that if one of these web pages load is increasing they should do caching on the other ones to prevent exhausting of available resources.

Numerical Method

For the numerical method, we compute a matrix of the tail dependence coefficients between the different web pages. We are interested in the values that are the closest to 1. They indicate that there is a high likelihood that if an extreme value in the tail arises for one web page, the other webpage is very likely to also display an extreme value in its high tail. Therefore, there is tail dependence and there is probably going to show a simultaneous high load for both (or more than two) pages. In this case, Wikipedia must anticipate and use the caching system on them.

2016_Summer_Olympics Diana,_Princess_of_Wales Elizabeth_II George_VI Prince_Philip,_Duke_of_Edinburgh Princess_Margaret,_Countess_of_Snowdon Queen_Victoria United_Kingdom United_States Winston_Churchill World_War_I World_War_II
2016_Summer_Olympics 1.0000000 0.0714286 0.0000000 0.0714286 0.0357143 0.1071429 0.0357143 0.1071429 0.2142857 0.0357143 0.0000000 0.0357143
Diana,_Princess_of_Wales 0.0714286 1.0000000 0.0714286 0.0714286 0.1428571 0.0714286 0.0714286 0.0357143 0.0357143 0.0000000 0.0000000 0.0357143
Elizabeth_II 0.0000000 0.0714286 1.0000000 0.4642857 0.6071429 0.4642857 0.2857143 0.1071429 0.0000000 0.0357143 0.1428571 0.0000000
George_VI 0.0714286 0.0714286 0.4642857 1.0000000 0.5357143 0.5714286 0.3214286 0.0714286 0.0357143 0.0714286 0.1071429 0.0357143
Prince_Philip,_Duke_of_Edinburgh 0.0357143 0.1428571 0.6071429 0.5357143 1.0000000 0.5357143 0.2142857 0.0714286 0.0000000 0.0357143 0.1071429 0.0357143
Princess_Margaret,_Countess_of_Snowdon 0.1071429 0.0714286 0.4642857 0.5714286 0.5357143 1.0000000 0.2142857 0.0714286 0.0714286 0.0000000 0.1071429 0.0357143
Queen_Victoria 0.0357143 0.0714286 0.2857143 0.3214286 0.2142857 0.2142857 1.0000000 0.0357143 0.0000000 0.0000000 0.1071429 0.0000000
United_Kingdom 0.1071429 0.0357143 0.1071429 0.0714286 0.0714286 0.0714286 0.0357143 1.0000000 0.0357143 0.1785714 0.1071429 0.1071429
United_States 0.2142857 0.0357143 0.0000000 0.0357143 0.0000000 0.0714286 0.0000000 0.0357143 1.0000000 0.0714286 0.1785714 0.1785714
Winston_Churchill 0.0357143 0.0000000 0.0357143 0.0714286 0.0357143 0.0000000 0.0000000 0.1785714 0.0714286 1.0000000 0.1071429 0.0714286
World_War_I 0.0000000 0.0000000 0.1428571 0.1071429 0.1071429 0.1071429 0.1071429 0.1071429 0.1785714 0.1071429 1.0000000 0.4285714
World_War_II 0.0357143 0.0357143 0.0000000 0.0357143 0.0357143 0.0357143 0.0000000 0.1071429 0.1785714 0.0714286 0.4285714 1.0000000